Search results for "Quantitative Biology - Genomics"

showing 10 items of 10 documents

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

2012

Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. Results We first used simulated reads to explore the relationship between the level of compression and the error rate, the leng…

FOS: Computer and information sciencesStatistics and ProbabilityBurrows–Wheeler transformComputer scienceData_CODINGANDINFORMATIONTHEORYBurrows-Wheeler transformcomputer.software_genreBiochemistryBurrows-Wheeler transform; Data Compression; Next-generation sequencingComputer Science - Data Structures and AlgorithmsEscherichia coliCode (cryptography)HumansOverhead (computing)Data Structures and Algorithms (cs.DS)Computer SimulationQuantitative Biology - GenomicsMolecular BiologyGenomics (q-bio.GN)Genome HumanString (computer science)Search engine indexingSortingGenomicsSequence Analysis DNAConstruct (python library)Data CompressionComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsFOS: Biological sciencesNext-generation sequencingData miningDatabases Nucleic AcidcomputerAlgorithmsData compression

researchProduct

High-speed and accurate color-space short-read alignment with CUSHAW2

2013

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are avail…

Genomics (q-bio.GN)FOS: Biological sciencesQuantitative Biology - Genomics

researchProduct

MetaCache-GPU: Ultra-Fast Metagenomic Classification

2021

The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their $k$-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growi…

Genomics (q-bio.GN)FOS: Computer and information sciencesSource codeComputer sciencemedia_common.quotation_subjectHash functionContext (language use)MinHashcomputer.software_genreData structureHash tableComputer Science - Distributed Parallel and Cluster ComputingFOS: Biological sciencesPreprocessorQuantitative Biology - GenomicsDistributed Parallel and Cluster Computing (cs.DC)Data miningcomputermedia_commonReference genome50th International Conference on Parallel Processing

researchProduct

Variations in Substitution Rate in Human and Mouse Genomes

2003

We present a method to quantify spatial fluctuations of the substitution rate on different length scales throughout genomes of eukaryotes. The fluctuations on large length scales are found to be predominantly a consequence of a coarse-graining effect of fluctuations on shorter length scales. This is verified for both the mouse and the human genome. We also found that both species show similar standard deviation of fluctuations even though their mean substitution rate differs by a factor of two. Our method furthermore allows to determine time-resolved substitution rate maps from which we can compute auto-correlation functions in order to quantify how fast the spatial fluctuations in substitu…

Genomics (q-bio.GN)GenomeModels GeneticGenome HumanRelative standard deviationSubstitution (logic)AutocorrelationPopulations and Evolution (q-bio.PE)Genetic VariationGeneral Physics and AstronomyGenomicsTime resolutionBiologyQuantitative Biology::GenomicsGenomeMiceEvolutionary biologyFOS: Biological sciencesAnimalsHumansQuantitative Biology - GenomicsHuman genomeQuantitative Biology - Populations and EvolutionRepetitive Sequences Nucleic Acid

researchProduct

Comparing DNA sequence collections by direct comparison of compressed text indexes

2012

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored. Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have h…

Genomics (q-bio.GN)SequenceComputer sciencebusiness.industrySearch engine indexingSequence alignmentPattern recognitionConstruct (python library)Data structureBurrows-Wheeler Transform; Splice junctions; External memoryExternal memoryFOS: Biological sciencesCode (cryptography)Quantitative Biology - GenomicsBurrows-Wheeler TransformArtificial intelligencebusinessSplice junctionsAuxiliary memoryReference genome

researchProduct

Inverted Repeats in Viral Genomes

2004

We investigate 738 complete genomes of viruses to detect the presence of short inverted repeats. The number of inverted repeats found is compared with the prediction obtained for a Bernoullian and for a Markovian control model. We find as a statistical regularity that the number of observed inverted repeats is often greater than the one expected in terms of a Bernoullian or Markovian model in several of the viruses and in almost all those with a genome longer than 30,000 bp.

Genomics (q-bio.GN)Statistical Mechanics (cond-mat.stat-mech)Complex systemInverted repeatGeneral Mathematicsviral genomeGeneral Physics and AstronomyFOS: Physical sciencesComputational biologyBiologyGenomeQuantitative Biology - Quantitative MethodsSettore FIS/07 - Fisica Applicata(Beni Culturali Ambientali Biol.e Medicin)stochastic processeViral genomesFOS: Biological sciencessecondary RNA struc- tureQuantitative Biology - GenomicsQuantitative Methods (q-bio.QM)Condensed Matter - Statistical MechanicsDNA probabilistic models

researchProduct

Statistical properties of thermodynamically predicted RNA secondary structures in viral genomes

2008

By performing a comprehensive study on 1832 segments of 1212 complete genomes of viruses, we show that in viral genomes the hairpin structures of thermodynamically predicted RNA secondary structures are more abundant than expected under a simple random null hypothesis. The detected hairpin structures of RNA secondary structures are present both in coding and in noncoding regions for the four groups of viruses categorized as dsDNA, dsRNA, ssDNA and ssRNA. For all groups hairpin structures of RNA secondary structures are detected more frequently than expected for a random null hypothesis in noncoding rather than in coding regions. However, potential RNA secondary structures are also present i…

Genomics (q-bio.GN)inverted repeatbioinformaticRNAstatistical physicsComputational biologyBiologyCondensed Matter PhysicsGenomeQuantitative Biology - Quantitative MethodsElectronic Optical and Magnetic MaterialsRNA silencingViral genomesFOS: Biological sciencesCoding regionQuantitative Biology - GenomicsQuantitative Methods (q-bio.QM)

researchProduct

Inverted and mirror repeats in model nucleotide sequences.

2007

We analytically and numerically study the probabilistic properties of inverted and mirror repeats in model sequences of nucleic acids. We consider both perfect and non-perfect repeats, i.e. repeats with mismatches and gaps. The considered sequence models are independent identically distributed (i.i.d.) sequences, Markov processes and long range sequences. We show that the number of repeats in correlated sequences is significantly larger than in i.i.d. sequences and that this discrepancy increases exponentially with the repeat length for long range sequences.

Independent identically distributedTime FactorsMolecular Sequence DataMarkov processNucleic Acid DenaturationQuantitative Biology - Quantitative MethodsCombinatoricssymbols.namesakeExponential growthChromosomes Human inverted repeatsNucleotideQuantitative Biology - GenomicsRNA Small InterferingQuantitative Methods (q-bio.QM)Sequence (medicine)MathematicsProbabilityRepetitive Sequences Nucleic AcidGenomics (q-bio.GN)chemistry.chemical_classificationStochastic ProcessesModels StatisticalBase SequenceNucleotidesProbabilistic logicMarkov ChainschemistryFOS: Biological sciencesNucleic acidsymbolsNucleic Acid RenaturationNucleic Acid ConformationAlgorithmsPhysical review. E, Statistical, nonlinear, and soft matter physics

researchProduct

Adaptive reference-free compression of sequence quality scores

2014

Motivation: Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full reso…

Statistics and ProbabilityFOS: Computer and information sciencesComputer sciencemedia_common.quotation_subjectReference-freecomputer.software_genreBiochemistryDNA sequencingSet (abstract data type)Redundancy (information theory)BWTComputer Science - Data Structures and AlgorithmsCode (cryptography)AnimalsHumansQuality (business)Data Structures and Algorithms (cs.DS)Quantitative Biology - GenomicsCaenorhabditis elegansMolecular Biologymedia_commonGenomics (q-bio.GN)SequenceGenomeSettore INF/01 - Informaticareference-free compressionHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNAData CompressioncompressionComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsFOS: Biological sciencesData miningquality scoreMetagenomicscomputerBWT; compression; quality score; reference-free compressionAlgorithmsReference genome

researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets

researchProduct